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Abstract 

A new system to recognize protein coding genes in the coronavirus genomes, specially suitable for the SARS-CoV genomes, has 
been proposed in this paper. Compared with some existing systems, the new program package has the merits of simplicity, high 
accuracy, reliability, and quickness. The system ZCURVE_CoV has been run for each of the 11 newly sequenced SARS-CoV 
genomes. Consequently, six genomes not annotated previously have been annotated, and some problems of previous annotations in 
the remaining five genomes have been pointed out and discussed. In addition to the polyprotein chain ORFs la and lb and the four 
genes coding for the major structural proteins, spike (S), small envelop (E), membrane (M), and nuleocaspid (N), respectively, 
ZCURVE_CoV also predicts 5-6 putative proteins in length between 39 and 274 amino acids with unknown functions. Some single 
nucleotide mutations within these putative coding sequences have been detected and their biological implications are discussed. A 
web service is provided, by which a user can obtain the annotated result immediately by pasting the SARS-CoV genome sequences 
into the input window on the web site (http://tubic.tju.edu.cn/sars/). The software ZCURVE_CoV can also be downloaded freely 
from the web address mentioned above and run in computers under the platforms of Windows or Linux. 

© 2003 Elsevier Inc. All rights reserved. 
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An outbreak of a life-threatening disease, referred to 
as severe acute respiratory syndrome (SARS), has 
spread to many countries around the world [1-6]. By 
late May 2003, the World Health Organization (WHO) 
has recorded more than 7000 cases of SARS and more 
than 600 SARS-related deaths, and therefore a global 
alert for the illness was issued due to the severity of the 
disease (http://www.who.int/csr/sars/en/). 

A growing body of evidence has convincingly shown 
that SARS is caused by a novel coronavirus, called 
SARS-coronavirus or SARS-CoV. Currently, the com¬ 
plete genome sequences of 11 strains of SARS-CoV 
isolated from some SARS patients have been sequenced 
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[7-9], and more complete genome sequences of SARS- 
CoV are expected to come. 

The SARS-CoV genomes are about 30 kb in length. 
For such short genome sequences, currently, there is no 
reliable software for the identification of protein-coding 
genes. Therefore, most sequenced genomes were anno¬ 
tated manually or not annotated. Among the 11 com¬ 
pleted sequences, six were not annotated yet and the 
remaining were annotated manually. 

Currently, most algorithms for gene identification in 
prokaryotic genomes, such as GeneMark.hmm [10] and 
Glimmer [11], are based either on the higher-order 
Markov chain model or the hidden Markov chain model 
in which thousands of parameters need to be trained. 
The large number of parameters may result in less 
adaptability, especially for small genomes. Meanwhile, 
ZCURVE [12] is a newly developed system for gene 
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recognition in bacterial and archaeal genomes, in which 
only 33 parameters are used and the recognition accu¬ 
racy is high. Therefore, the ZCURVE algorithm essen- 
tializes the coding properties of protein-coding genes 
with relatively small number of parameters. Thus, it is 
not only suitable for large but also especially suitable for 
small genomes. 

In this paper, we describe a system, called ZCUR- 
VE_CoV, based on a coronavirus-specific ZCURVE 
algorithm, which is especially suitable for gene recog¬ 
nition in SARS-CoV genomes. The system has the ad¬ 
vantages of simplicity, reliability, high accuracy, and 
quickness. The software system ZCURVE_CoV is freely 
available at http://tubic.tju.edu.cn/sars/. 

Materials and methods 

Six genome sequences of coronaviruses and the annotation infor¬ 
mation were downloaded from the web site of NCBI RefSeq project 
(http://www.ncbi.nih.gov/RefSeq/). These coronaviruses include avian 
infectious bronchitis virus (NC_001451), bovine coronavirus 
(NC_003045), human coronavirus 229E (NC_002645), murine hepa¬ 
titis virus (NC_001846), porcine epidemic diarrhea virus (NC_003436), 
and transmissible gastroenteritis virus (NC_002306). A total of 48 
genes were extracted from the above six genomes and used to train the 
gene-finding algorithm. Currently, 15 genome sequences of SARS 
coronavirus (SARS-CoV) strains are available in the GenBank data¬ 
base, of which there are 11 complete and four partial genomes, re¬ 
spectively. The former includes SARS-CoV TOR2 (Accession No. 
AY274119), Urbani (AY278741), HKU-39849 (AY278491), CUHK- 
W1 (AY278554), BJ01 (AY278488), CUHK-SulO (AY282752), 
SIN2500 (AY283794), SIN2748 (AY283797), SIN2679 (AY283796), 
SIN2774 (AY283798), and SIN2677 (AY283795), whereas the latter 
includes SARS-CoV BJ02 (AY278487), BJ03 (AY278490), BJ04 
(AY279354), and GZ01 (AY278489), respectively. 

The gene-finding algorithm presented in this paper is based on the 
Z curve [13], which is a graphic representation of DNA sequences. The 
Z curve method has been used to recognize protein coding genes in the 
budding yeast genome [14], A new ab initio gene-finding system for 
bacterial and archaeal genomes has been developed recently, based on 
the Z curve method [12], Here the method with some modifications is 
used to recognize protein coding genes in coronavirus genomes, which 
is presented briefly as follows. Suppose that the occurrence frequencies 
of the bases A, C, G, and T (U) at the first, second, and third codon 
positions in an ORF are denoted by a h c„ g lt and respectively, 

where i = 1,2,3. The four numbers, a h c h gj, and t t , are mapped onto 

a point in a 3-dimensional space V, with the coordinates 

= («, + gi) - (c, + ti), 

yi = (a; + ci) - ( g, + /,), i = 1,2, 3, (1) 

z, = (a, + t ,-) - (g, + c,). 

Then, each ORF may be represented by a point or a vector in a 
9-dimensional space V, where V = Vi ® V 2 ® V 3 , where the symbol © 
denotes the direct-sum of two subspaces. The nine components ui-ug 
of the space V are defined as follows: 

{ Mi = *i, M2 — yi, M3 = z\, 

u 4 - x 2 , u 5 = y 2 , u 6 - z 2 , (2) 

u 2 =x 3 , u 9 =z 3 . 

To train the system, two sets of samples are needed, which are 
positive samples corresponding to protein-coding genes (seed ORFs) 
and negative samples corresponding to non-coding sequences. In the Z 
curve method, essentially, the gene recognition is based on the com¬ 


positional asymmetry of three codon positions in coding sequences. It 
was shown that the overall extent of codon usage bias in RNA viruses 
is low and there is little variation in bias between genes [15], Coro¬ 
naviruses belong to the coronaviridae and the G + C content of the 
published coronavirus genomes ranges from 37% to 42% [7], There¬ 
fore, it is reasonable to deduce that the published coronavirus genomes 
have similar codon usage. Based on this consideration, it is possible 
that gene-finding parameters derived from some published coronavirus 
genomes may be applied to recognize genes in other coronavirus ge¬ 
nomes. Because the SARS-CoV genomes are relatively small («30 kb), 
it is difficult to obtain enough seed ORFs from its own genome. 
Therefore, we used some other published coronavirus genomes to train 
gene-finding parameters. Consequently, the genomes of avian infec¬ 
tious bronchitis virus, bovine coronavirus, human coronavirus 229E, 
murine hepatitis virus, porcine epidemic diarrhea virus, and trans¬ 
missible gastroenteritis virus, respectively, were used, in which 48 seed 
ORFs were selected. The detailed information about the 48 seed ORFs 
is described in Table 1 of the supplementary materials (see: http://tu- 
bic.tju.edu.cn/sars/). 

Below we describe the strategy to produce the negative samples. It 
is a rather difficult problem to produce an appropriate set of non¬ 
coding sequences in coronavirus genomes, because the amount of non¬ 
coding DNA sequences in these genomes is too few to be used. A 
method to produce negative samples has been developed previously 
and it has been shown to be an effective way to solve the problem [ 12 ], 
The same method is still used in the current study. In this method, a 
negative sample is just derived from a seed ORF. Generally speaking, 
if the regular structure of a coding sequence is completely destroyed, it 
is transformed into a non-coding one. Therefore, the negative sample 
may be simply obtained by shuffling the corresponding coding se¬ 
quence sufficiently (20,000 times in current study). The resulting ran¬ 
dom sequences from all 48 seed ORFs were used as non-coding 
sequences. The major difference is that the former has some regular 
structures, whereas the latter is a random sequence. In fact, a random 
sequence is not a non-coding sequence, but it is a good approximation. 
As shown below, this approximation generally results in good gene¬ 
finding results. 

The Fisher linear equation for discriminating the positive and 
negative samples in the 9-dimensional space V represents a super¬ 
plane, described by a vector c which has nine components ci,C2,..., 
and C 9 . For more details about Fisher discrminant algorithm, refer to, 
for example [14]. Based on the data in the training set (including the 
positive and negative samples), the vector c and the threshold c 0 are 
obtained. The decision of coding/non-coding for each ORF and neg¬ 
ative sample is simply made by the criterion of c ■ u > co/c ■ u < Co, 
where c = (ci, C 2 ,... ,eg) 7 , u = (uj, m 2 , • • •, m 9 ) t , and “T” indicates the 
transpose of a matrix. The criterion of c ■ u > co/c ■ u < Co for making 
the decision of coding/non-coding can be rewritten as Z(u) > 0/ 
Z(u) < 0, where Z(u) = c ■ u — Co. Z(u) is called the Z score or Z index 
for an ORF or a fragment of DNA sequence. Finally, the strategy to 
deal with overlapping ORFs used here is similar to that described in 
the previous paper [12], 


Results and discussions 

Comparison with the existing system — GeneMark.hmm 

No coronavirus-specific annotation systems have 
been available so far. Currently, GeneMark.hmm is 
commonly used for gene-finding in virus genomes [10]. 
We submitted the SARS-CoV TOR2 genome to Gene¬ 
Mark.hmm website using default settings and the pre¬ 
diction result is listed in Table 1. It can be seen that the 
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Table 1 

The genes predicted by GeneMark.hmm for the SARS-CoV, TOR2 
strain 


Gene 

Start 

Stop 

Gene length (bp) 

1 

<3 

53 

51 

2 

265 

13,413 

13,149 

3 

13,599 

21,485 

7887 

4 

21,492 

25,259 

3768 

5 

25,268 

26,092 

825 

6 

26,398 

27,063 

666 

7 

27,074 

27,265 

192 

8 

27,273 

27,641 

369 

9* 

27,864 

28,118 

255 

to¬ 

28,130 

28,426 

297 

il* 

28,423 

29,388 

966 


The same genome was submitted several times to the website, 
however, the prediction results were not identical at all times, indi¬ 
cating that the system is unstable. An important structural protein 
gene (N protein), which is located from 28120 to 29388, was predicted 
as ‘gene 10’ and ‘gene IT in some predicted results. Sometimes, ‘gene 
9,’ a quite conserved ORF in all of the 11 SARS-CoV genomes men¬ 
tioned above, was not predicted. In addition, the gene coding for a 
structural protein (small envelope protein E) was also missed by the 
prediction. For more details, see the text. 


predicted ‘gene 1’ is questionable, because of its short 
length and the lack of a start codon. An important 
structural protein gene (small envelope protein E), 
which is located from 26117 to 26347, was not predicted 
by GeneMark.hmm. Moreover, we submitted the same 
genome sequence several times to the website, however, 
the prediction results were not identical at all times, 
indicating that the system is unstable. An important 
structural protein gene (N protein), which is located 
from 28120 to 29388, was predicted as ‘gene 10’ and 
‘gene 11’ (marked with * in Table 1) in some predicted 
results. Sometimes, ‘gene 9' (marked with * in Table 1), 
a quite conserved ORF in all of the 11 SARS-CoV ge¬ 
nomes mentioned above, was not predicted. Compared 
with GeneMark.hmm for gene-finding in the SARS- 
CoV genomes, the performance of ZCURVE_CoV is 
better (see Table 3 in the supplementary materials). 

Apply ZCURVE_CoV to analyze the SARS-CoV ge¬ 
nomes 

Currently, the genome sequences of 15 SARS-CoV 
strains are available in GenBank/EMBL databases, of 
which there are 11 complete and four partially complete 
genomes. The gene-finding software ZCURVE_CoV 
Version 1.0 has been run for each of the 11 complete 
SARS-CoV genomes. To save space, the detailed results 
are listed in Table 3 of the supplementary materials (see 
also the discussion below). In addition to the polypro¬ 
tein chain ORFs la and lb, the program predicts four 
structural genes coding for the four major structural 
proteins, i.e., spike (S), small envelop (E), membrane 
(M), and nuleocaspid (N), respectively, in all the 11 


SARS-CoV genomes. Additionally, ZCURVE_CoV 1.0 
also predicts 5-6 putative proteins with lengths between 
39 and 274 amino acids for the 11 genomes. These pu¬ 
tative genes might code for non-structural proteins in 
the SARS-CoV genomes. 

To compare the gene-finding result of the system 
ZCURVE_CoV 1.0 with that of known annotation, the 
SARS-CoV TOR2 strain is used as an example. The 
genome of TOR2 strain was annotated manually [8] and 
the annotated result is listed on the left part of Table 2, 
whereas the annotated result of ZCURVE_CoV 1.0 is 
listed on the right part of Table 2. As we can see both 
annotations are in good agreement with each other, 
except three ORFs. The three ORFs, i.e., ORF4, 
ORF 13, and ORF 14 annotated by Marra et al. [8] are 
not predicted by ZCURVE_CoV 1.0. These ORFs are 
completely embedded, with a frameshift, within the 
genes coding for some structural proteins. The absence 
of the transcription regulating sequences (TRSs) at the 
5' end of these ORFs [8] suggests that they are unlikely 
to be the protein-coding genes. The principal component 
analysis performed below further confirms the above 
conjecture. As mentioned in the Materials and methods 
section, each ORF is represented by a point in a 9-di¬ 
mensional (9-D) space. Consequently, the positive 
samples (genes) and negative samples (non-coding se¬ 
quences) are represented by two groups of points in the 
9-D space, respectively. For the TOR2 strain, the 12 
putative genes predicted by ZCURVE_CoV and ORF 4, 
ORF 13, and ORF 14 are represented by the corre¬ 
sponding points in the 9-D space, respectively. We 
project the points in the 9-D space onto the 3-D space 
spanned by the first, second, and third principal axes 
based on the principal component analysis. The fraction 
of the first three principal components accounts for 
about 70% of the total inertia of the 9-D space. Fig. 1 
shows the distribution of the corresponding points in the 
3-D space, where green and orange balls represent the 
positive samples (genes) and negative samples (non¬ 
coding sequences), respectively. Blue balls correspond to 
the genes predicted by ZCURVE_CoV for the TOR2 
strain, while red balls correspond to ORF 4, ORF 13, 
and ORF 14 annotated by Marra et al. [8]. It is clear 
that the three red balls are located at the side of non¬ 
coding sequences, indicating that ORF 4, ORF 13, and 
ORF 14 are very unlikely to code for proteins. 

Similar analysis was performed to the Urbani strain 
[7]. The result is listed in Table 3, in which the putative 
gene X2 annotated by Rota et al. [7], corresponding to 
ORF 4 in Marra et al. [8], is not predicted by ZCUR- 
VE_CoV. Based on the above analysis, X2 is also very 
unlikely to code for a protein. Of the 11 complete 
SARS-CoV genomes, six have not yet been annotated. 
We have run the program ZCURVE_CoV for each of 
the 11 genomes. Consequently, those already annotated 
have been re-annotated and those not annotated yet 
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Table 2 

Comparison of the genes annotated and those predicted by ZCURVE_CoV 1.0, for the SARS-CoV, TOR2 strain 


Genes annotated 




Genes predicted by ZCURVE_CoV 1.0 


Start 

Stop 

bp 

a.a. 

Feature 

Start 

Stop 

bp 

a.a. 

Feature 

265 

13,398 

13,134 

4377 

ORF la 

265 

13,398“ 

13,134 

4377 

ORF la 

13,398 

21,485 

8088 

2695 

ORF lb 

13,398“ 

21,485 

8088 

2695 

ORF lb 

21,492 

25,259 

3768 

1255 

S protein 

21,492 

25,259 

3768 

1255 

S protein 

25,268 

26,092 

825 

274 

ORF 3 

25,268 

26,092 

825 

274 

Sars274 

25,689 

26,153 

465 

154 

ORF 4 






26,117 

26,347 

231 

76 

E protein 

26,117 

26,347 

231 

76 

E protein 

26,398 

27,063 

666 

221 

M protein 

26,398 

27,063 

666 

221 

M protein 

27,074 

27,265 

192 

63 

ORF 7 

27,074 

27,265 

192 

63 

Sars63 

27,273 

27,641 

369 

122 

ORF 8 

27,273 

27,641 

369 

122 

Sarsl22 

27,638 

27,772 

135 

44 

ORF 9 

27,638 

27,772 

135 

44 

Sars44 

27,779 

27,898 

120 

39 

ORF 10 

27,779 

27,898 

120 

39 

Sars39 

27,864 

28,118 

255 

84 

ORF 11 

27,864 

28118 

255 

84 

Sars84 

28,120 

29,388 

1269 

422 

N protein 

28,120 

29,388 

1269 

422 

N protein 

28,130 

28,426 

297 

98 

ORF 13 






28,583 

28,795 

213 

70 

ORF 14 







a The program ZCURVE_CoV 1.0 has two options. The default option is to use the heptamer UUUAAAC as the conservative ‘slippery sequence’ 
to find the coronavirus -1 frameshift site [16]. Once the heptamer is found in the upstream sequence near the ending site of ORF la originally 
predicted, the ending site of ORF la and starting site of ORF lb are both corrected to the frameshift site (13398 in this genome) according to this 
‘slippery sequence.' Otherwise, if this heptamer cannot be found, only the original sites predicted for ORF la and ORF lb are displayed in the output 
file. The second option is to ignore the -1 frameshift, and the original sites predicted for ORF la and ORF lb are always displayed, regardless of the 
existence of the heptamer UUUAAAC. 


Table 3 

Comparison of the genes annotated and those predicted by ZCURVE_CoV1.0, for the SARS-CoV, Urbani strain 


Genes annotated 




Genes predicted by ZCURVE_CoV 1.0 


Start 

Stop 

bp 

a.a. 

Feature 

Start 

Stop 

bp 

a.a. 

Feature 

265 

13,398 

13,134 

4377 

ORF la 

265 

13,398“ 

13,134 

4377 

ORF la 

13,398 

21,485 

8088 

2695 

ORF lb 

13,398“ 

21,485 

8088 

2695 

ORF lb 

21,492 

25,259 

3768 

1255 

S protein 

21,492 

25,259 

3768 

1255 

S protein 

25,268 

26,092 

825 

274 

XI 

25,268 

26,092 

825 

274 

Sars274 

25,689 

26,153 

465 

154 

X2 






26,117 

26,347 

231 

76 

E protein 

26,117 

26,347 

231 

76 

E protein 

26,398 

27,063 

666 

221 

M protein 

26,398 

27,063 

666 

221 

M protein 

27,074 

27,265 

192 

63 

X3 

27,074 

27,265 

192 

63 

Sars63 

27,273 

27,641 

369 

122 

X4 

27,273 

27,641 

369 

122 

Sarsl22 






27,638 

27,772 

135 

44 

Sars44 






27,779 

27,898 

120 

39 

Sars39 

27,864 

28,118 

255 

84 

X5 

27,864 

28,118 

255 

84 

Sars84 

28,120 

29,388 

1269 

422 

N protein 

28,120 

29,388 

1269 

422 

N protein 


a See the footnote in Table 2. 


have been annotated. All of the annotated results are 
listed in Table 3 of the supplementary materials. 

Analyze the mutations of the six putative non-structural 
genes by sequence alignment 

To test the nucleotide mutations of the predicted 
genes coding for non-structural proteins, we aligned the 
coding sequences of Sars274, Sars63, Sarsl22, Sars44, 
Sars39, and Sars84, respectively, for the 11 complete 
SARS-CoV genomes using ClustalW 1.8 [17]. The results 
of multiple sequence alignment for the above six pre¬ 
dicted genes coding for non-structural proteins are listed 


in Fig. 1 of the supplementary materials. For the three 
ORFs, Sarsl22, Sars44, and Sars84, the nucleotide se¬ 
quences are all conserved in the 11 SARS-CoV genomes, 
indicating that the three ORFs might have crucial bio¬ 
logical functions. Mutations in these gene sequences 
would result in loss of important functions. Therefore, 
these coding sequences might serve as the candidate 
targets for designing drugs against SARS. On the con¬ 
trary, Sars39 is not found in the strains S1N2677 and 
SIN2748, and a nucleotide mutation occurs at nucleotide 
position 49, leading to the mutation of Cys —> Arg in the 
strains BJ01 and CUHK-W1. The rapid mutations oc¬ 
curring in Sars39 imply that it is probably not a key 
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Fig. 1. Distribution of the mapping points corresponding to genes, non-genes, predicted genes, and questionable ORFs for the SARS-CoV, TOR2 
strain in a 3-dimensional (3-D) space. Each gene or ORF is mapped onto a point in a 9-D space. To visualize the distribution, the mapping points are 
projected onto the 3-D space spanned by the first three principal axes based on the principal component analysis. The first, second, and third 
principal vectors are denoted by the X-, Y-, and Z-axes, respectively. The fraction of the first three principal components accounts for 69.59% of the 
total inertia of the 9-D space. Green and orange balls represent the positive samples (genes) and negative samples (non-coding sequences), re¬ 
spectively. Blue balls correspond to the genes predicted by ZCURVE_CoV for the TOR2 strain, while red balls correspond to ORF 4, ORF 13, and 
ORF 14 annotated by Marra et al. [8], It is clear that the three red balls are situated at the side of non-coding sequences, indicating that ORF 4, ORF 
13, and ORF 14 are very unlikely to code for proteins. 


protein for SARS-CoV. For Sars63, two nucleotide 
mutations are observed at the base positions 38 and 170, 
leading to amino acid mutations of Glu—>Gly and 
Pro —> Leu in the strains SIN2677 and BJ01, respectively. 
See Fig. 1 in the supplementary materials for the detail. 

The result of ClustalW alignment for Sars274 is 
shown in Fig. 2. Four nucleotide mutations, located at 
31, 302, 406, and 783, respectively, at three different 


strains have been detected. The first three variations 
cause amino acid mutations (Fig. 2). The last substitu¬ 
tion is a synonymous codon mutation which does not 
lead to amino acid change. The point mutations occur¬ 
ring at nucleotide positions 31, 302, and 406, respec¬ 
tively, cause amino acid changes. At the 31st position, 
G —> A (TOR2)=>Gly—> Arg. Similarly, at the 302nd 
position, T (U) —> A (HKU-39849) => Met —> Lys; and at 


TOR2 

HKU-39849 

Urbani 

SIN2748 

SIN2774 

SIN2679 

SIN2677 

SIN2500 

CUHK-SulO 

CUHK-W1 

BJ01 


1 31 302 406 783 825 

ATGGATTTGT.CTCTTAGATCA..AGGTATGGAGG.AATCCAAGAACC.GATCCAATTTA.GCCTTTGTAA 

ATGGATTTGT.CTCTTAGATCA.AGGTAAGGAGG..AATCCAAGAACC.GATCcjAATTTA.GCCTTTGTAA 

ATGGATTTGT.CTCTT|GATCA.AGGTATGGAGG.AATCCAAGAACC.GATCCAATTTA.GCCTTTGTAA 

ATGGATTTGT.CTCTTlGATCA. AGGTaIgGAGG.AATCCAAGAACC.GATCCAATTTA.GCCTTTGTAA 

ATGGATTTGT.CTCTTGGATCA.AGGTATGGAGG.AATCCAAGAACC.GATCCAATTTA.GCCTTTGTAA 

ATGGATTTGT.CTCTTgGATCA.AGGTAIGGAGG.AATCCAAGAACC.GATCCAATTTA.GCCTTTGTAA 

ATGGATTTGT.CTCTTGGATCA.AGGTATGGAGG.AATCCAAGAACC.GATCCAATTTA.GCCTTTGTAA 

ATGGATTTGT.CTCTTlGATCA.AGGTATGGAGG.AATCCAAGAACC.GATCCAATTTA.GCCTTTGTAA 

ATGGATTTGT.CTCTTlGATCA.AGGTATGGAGG.AATCCAAGAACC.GATCCAATTTA.GCCTTTGTAA 

ATGGATTTGT.CTCTTgGATCA.AGGTATGGAGG.AATCCAAGAACC.GATCCAATTTA.GCCTTTGTAA 

ATGGATTTGT.CTCTTjgGATCA.AGGTA®GGAGG.AATCCjCAGAACC.GATCCCATTTA.GCCTTTGTAA 


Mutations: 


TOR2 

Codon 

(31) 

GGA 


AGA 

Amino 

acid 

(11) 

Gly 

-> 

Arg 

HKU-39849 

Codon 

(302) 

ATG 


AAG 

Amino 

acid 

(101) 

Met 

-> 

Lys 

BJ01 

Codon 

(406) 

AAG 


CAG 

Amino 

acid 

(135) 

Lys 


Gin 


Fig. 2. Nucleotide mutations of the predicted gene Sars274 based on the alignment of corresponding coding sequences in 11 complete genome se¬ 
quences. A total of four point mutations are detected, of which one is a silent mutation and the other three cause amino acid changes in the putative 
genes. The point mutations occur at nucleotide positions 31, 302, 406, and 783, respectively. At the 31st position, G^A (TOR2)=)> Gly —> Arg. 
Similarly, at the 302nd position, T (U) —> A (HKU-39849) =>■ Met —» Lys; at the 406th position, A - - C (BJ01) =>• Lys-^ Gin; and at the 783rd position, 
A—>C (BJ01), but no amino acid change. 
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the 406th position, A—>C (BJ01)=^Lys—+Gln. On the 
other hand, it was reported by Marra et al. [8] that there 
exist three frans-membrane regions spanning approxi¬ 
mately at nucleotide positions 102 —^ 168 (residues 34—> 
56), 231 297 (77 —> 99), and 309^375 (103->115), 

respectively, in Sars274 sequence. Therefore, the muta¬ 
tions occur outside of the predicted /ra/t.v-mcmbranc 
regions. Note that the second mutation of amino acid 
(Met —> Lys) is essential, as reflected by the fact that Met 
is a relatively strong hydrophilic amino acid, whereas 
Lys is a strong hydrophobic one. At present, we cannot 
know whether these mutations cause severe conforma¬ 
tional changes in the tertiary structure of this putative 
protein. The high mutation rate of Sars274 implies that 
either it might be a relatively unimportant protein for 
SARS-CoV, or the mutations do not lead to biological 
function changes dramatically. Finally, for the time 
being we still cannot rule out the possibility that all or a 
part of these mutations are caused by sequencing errors. 

Supplementary materials 

The detailed supplementary materials related to this 
study are available from the website http://tubic.tju. 
edu.cn/sars/, which includes the following content: 

(a) Table 1. The 48 seed ORFs and the six corona- 
virus genomes from which the seed ORFs are derived. 

(b) Table 2. The Fisher coefficients and threshold 
obtained from the seed ORFs. 

(c) Table 3. Results of gene-finding using ZCUR- 
VE_CoV for the 11 SARS-CoV complete genomes. 

(d) Fig. 1. The results of multiple sequence alignment 
of the six predicted genes coding for non-structural 
proteins, Sars274, Sars63, Sarsl22, Sars44, Sars39, and 
Sars84, respectively. 

Online service and availability of the program ZCUR- 
VE_Co V 

A web interface of the ZCURVE_CoV system has 
been constructed. When a user pastes a SARS-CoV 
genome sequence to the input window of the website, 
the gene-finding result will be returned to the user im¬ 
mediately. A user may also download the executable 
version of the program ZCURVE_CoV and run it on 
the computers under the platforms of either Windows 
(95/98/NT/Me/2000 or higher), or Linux (Redhat 7.1 or 
higher), or SGI IRIX 6.5. For more detailed informa¬ 
tion, visit: http://tubic.tju.edu.cn/sars/. 


Conclusion 

Severe acute respiratory syndrome (SARS) is an ex¬ 
tremely severe disease that has spread to many countries 
around the world. Accumulating evidence has shown 


that SARS is caused by a new coronavirus, i.e., SARS- 
CoV. A new system to recognize protein-coding genes in 
SARS-CoV genomes, called ZCURVE_CoV, has been 
reported in this paper. By applying the program to 11 
complete SARS-CoV genomes, six genomes not anno¬ 
tated previously have been annotated, and some prob¬ 
lems of previous annotations in the remaining five 
genomes have been pointed out and discussed. It is 
shown that the three protein-coding ORFs annotated by 
Marra et al. [8], i.e., ORF 4, ORF 13, and ORF 14, are 
very unlikely to code for proteins. In addition to 
ORF la, ORF lb, and the four genes coding for the 
major structural proteins S, E, M, and N, the new sys¬ 
tem ZCURVE_CoV also predicts 5-6 putative genes 
coding for non-structural proteins. Aligning each of the 
non-structural gene sequences based on the 11 complete 
genomes, some mutations have been detected. The 
biological implications of the mutations have been 
discussed. 
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